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METHOD FOR AUTOMATIC COMMUNITY MODEL GENERATION BASED 

ON UNI-PARITY DATA 

5 

STATEMENT OF GOVERNMENT INTEREST 

The invention described herein may be manufactured and used by or for the 
Government of the United States for governmental purposes without the payment of any 
royalty thereon. 

10 BACKGROUND OF THE INVENTION 

It can be very useful to know about activities between individuals. For example, 
what individuals are associated with other individuals? Which individuals communicate 
with other individuals? When two or more individuals get together is there an intended 
purpose? Who are the leaders or important individuals of a group? What is the 

15 organizational structure of the group? It can prove useful further yet to have the 
capability to actually model the above types of interactions and associations. To an 
extent, this type of social research has been addressed by employing the disciplines of 
data mining and community generation. 

Examples of such problems include mining movie data to find out how 

20 actors/actresses, directors, and producers are linked to different movies and how the 
movies are linked to different awards; mining on Web conununity or topic related 
documents to find out where the hubs and authorities or the related documents are and 
how they are Hnked together; mining the commercial merchandise sales data of a 
franchise store nation-wide to determine the associations (or correlations) among a group 

25 of merchandise items; mining customer search topic data collected over a period of time 
in a library to identify a group of related common interests and their relationships; and 
mining the traffic data collected from a wide network of geographical locations nation- 
wide or within a specific area (e.g., NY City) to find out the traffic accident pattem 
correlations among a group of locations. The government or civihan sector also has a 

30 number of requirements for such a capability. Such examples include the identification 
of terrorist cells, crime rings such as money laundering, drug interdiction and the 
identification of tactical units in the battlefield. 
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In some of the problems the data is given with existing links such as the movie 
data with actor-movie links and the Web data with Web links while in others the data is 
given completely in isolation and no link information is available such as sales data, 
5 customer search topic data collected from a library, or traffic records collected in 

different geographical locations. The goal then is to generate communities based on yet- 
to-be-determined links between the data items. Current research in conraiunity generation 
focuses on the former and is addressed under the area of relational data mining and 
learning in the literature. But what happens when you don't have explicit 

10 link/relationship information? To our knowledge, nobody has systematically addressed 
this class of problems and in fact it has not even been identified as another paradigm 
within the community generation area let alone the data mining community. To this 
avail, we have entitled this set of problems as the Uni-party Data Community Generation 
(UDCG) problem. To facilitate the comparison, we call the former class of problems 

1 5 (where we know or are given the relationships) as Bi-party Data Community Generation 
(BDCG) problems. 

OBJECTS AND SUMMARY OF THE INVENTION 

It is therefore an object of the present invention to provide a methodology for 
20 solving a uni-party data community generation paradigm. 

A further object of the present invention is to provide a method which employs 
automatic community model generation for solving a uni-party data community 
generation paradigm. 

Yet another object of the present invention is to employ Link Discovery based on 
25 Correlation Analysis (LDCA) for generating an automatic community model. 

A particular object of the present invention is to provide a method for solving a 
Money Laundering Crime (MLC) case. 

Briefly stated, the present invention provides a method for automatic community 
model generation based on uni-parity data. Correlation analysis is employed to identify 
30 links within the community. Method may be particularized for solving specific problems 
such as determining the activities with a money laundering ring. 
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A generalized embodiment of the present invention, method for automatic 
community model generation based on uni-parity data, comprises the steps of 
hypothesizing a subset S of set U, wherein for any pair of items in subset S there exists a 
mathematical function C applicable to the pair of items so as to generate a correlation 
5 value and correlation relationship between any pair of items in subset S; generating 
correlation values by applying the function C to each of the pairs of items in subset S; 
graphing G(S,E), wherein E is the edge set of graph G with computed correlation values 
as weights; and mapping graph G to one of its subgraphs McG so as to generate a 
community. 

10 A further embodiment of the present invention, method for solving a community 

generation problem, comprises the steps of converting documents to digital 
form and tagging the digitized documents; parsing the digitized and tagged documents to 
extract the transaction history vector for each individual; creating timeUnes of the 
transaction vectors so as to form a timeline map; determining the relevancy of the 

15 vectors; projecting the vectors along a time dimension so as to form as histogram; 
translating the vectors into groups of activities by histogram clustering; determining the 
local correlation between any pair of clusters in the timeline of two individuals; 
computing the global correlations between pairs of individuals; converting data to a graph 
as a function of all individuals extracted from the documents and the correlation values 

20 between individuals; generating models based on a search of all subgraphs with 
correlation values above a threshold; and outputting a group model. 

A particular embodiment of the present invention for solving a money laxmdering 
problem comprises applying the "one way nearest neighbor" principle, wherein the "one 
way nearest neighbor" principle further comprises that for every person's name 

25 encountered, the first immediate time instance is the first time instance for a series of 
financial activities; the second immediate time instance is the second time instance for 
another series of financial activities, etc.; for every time instance encountered, all the 
subsequent financial activities are considered as the series of financial activities between 
this time instance and the next time instance; financial activities are identified in terms of 

30 money amount; money amount is neutral in terms of deposit or withdrawal; each person's 
time sequence of financial activities is updated if new financial activities of this person 
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are encountered in other places of the same document or in other documents; and the 
financial activities of each time instance of a person is updated if new financial activities 
of this time instance of the same person are encountered in other places of the same 
document or in other docxmients. 

5 To the accomplishment of the foregoing and related ends, the present invention, 

then, comprises the features hereinafter fiiUy described and particularly pointed out in the 
claims. The following description and the annexed figures set forth in detail certain 
illustrative embodiments of the invention. These embodiments are indicative, however, 
of but a few of the various ways in which the principles of the invention may be 

10 employed Other objects, advantages and novel features of the present invention will 
become apparent fi*om the following detailed description of the invention when 
considered in conjunction with the figures. 

BRIEF DESCRIPTION OF THE DRAWINGS 
15 FIGURE 1 depicts the primary processes comprising a preferred embodiment of the 
present invention. 

FIGURE 2 depicts a block diagram process flow chart of an illustrative example of the 
preferred embodiment to solve a money laundering crime problem. 
FIGURE 3 depicts an event-driven, three-dimensional, nested data stmcture fi-om the 
20 money laundering crime problem. 

FIGURE 4 depicts a timeline map fi'om the three-dimensional, monetary vector money 
laimdering crime problem. 

FIGURE 5 depicts a clustering algorithm based on histogram segmentation from the 
money laundering crime problem. 

25 FIGURE 6 depicts an illustration of the algorithm to determine the correlation between 
two individuals from the money laundering crime problem. 

DETAILED DESCRIPTION OF THE GENERALIZED EMBODIMENT 
In this section, we propose a general methodology, called Link Discovery based 
30 on Correlation Analysis (LDCA), as a solution to the general uni-party data community 
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generation problem. LDCA uses a correlation measure to determine the "similarity*' of 
pattems between two data items to infer the strength of their linkage. The correlation 
measure may be defined in fuzzy logic to accommodate the typical impreciseness of the 
"similarity" of pattems. 
5 Referring to FIGURE 1, the components of LDCA as well as the data flow of 

these components are depicted. In principle, LDCA consists of three basic steps. For each 
problem in the uni-party data conmiimity generation paradigm, assume that the data item 
set is U, A Link Hypothesis step 100 hypothesizes a subset S of U, such that for any pair 
of the items in S there exists a mathematical function (or a procedural algorithm) C that 
10 appUes to this pair of items to generate a correlation value in the range of [0, 1], i.e., this 
step defines the correlation relationship between any pair of items in S: 

yp,gGSct/,C:5x5^[0,l] 

15 A Link Generation step 110 then appUes the function C to every pair of items in S to 
generate the correlation values. This results in a complete graph G(S,E) where E is the 
edge set of the graph with computed correlation values as the weights of the edges. 
Finally, a Link Identification step 120 defines another function P that maps the complete 
graph G to one of its subgraph Mc: G as a generated community. 

20 

AN ILLUSTRATIVE EXAMPLE OF THE PREFERRED EMBODIMENT 
MONEY LAUNDERING CRIME 

The Link Discovery based on Correlation Analysis (LDCA) meftiodology was 
25 applied to solving a specific community generation problem - the identification of 

members within a Money Laundering Crime (MLC) Group. Specific algorithms are used 
in the LDCA process. Such algorithms have been implemented and tested in a prototype 
system which the present invention refers to as CORrelation AnaLysis (CORAL). 
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Preparing the Data 

The input data to the MLC model generation problem is based on free text 
documents. The data is obtained from varying sources, such as bank statements, financial 
transaction records, personal communication letters (including emails), loan/mortgage 
5 . documents, £is well as other related reports. 

Referring to FIGURE 2, the documents are converted 130 to a digital format using 
an OCR and key entities, (e.g., person names, organization names, financial transaction 
times and dates, location addresses, as well as transaction money amounts) are tagged 
130 using an extraction tool using XML. No link information is tagged, thereby making 
10 the problem an excellent candidate for applying the LDCA methodology. 

Once the data set is identified and acquired (i.e., obtained, converted and tagged), it ' 
must be developed to define an internal data structure. Due to the nature of the data and 
the lack of detailed meta-like data, a number of rules and assumptions are required. The 
rules and assumptions to be applied by the present invention are: 

15 

• The data set U is the set of all extracted individuals from the collection of the 
given documents. 

• For each individual, there is a corresponding financial transaction history vector 
(may be null) along timeline. 

20 • The correlation between two individuals is defined through a correlation function 
between the two corresponding financial transaction history vectors. 

• If two individuals are in the same MLC group, they should exhibit similar 
financial transaction pattems, and thus, should have a higher correlation value. 

• Any two individuals may have a correlation value (including 0), i.e., 5 = {7. 

25 

Since the present invention has access to only the isolated and tagged entities in the 
document, assumption must be made to reasonably "guess" the associated relationships 
between the extracted time/date stamps and the money amount of a specific transaction 
with the extracted individual. Therefore, when the present invention parses 140 the 
30 collection of documents to extract the financial transaction history vectors for every 
individual, it follows the "one way nearest neighbor" principle: 
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• For every person's name encountered, the first immediate time instance is the 
first time instance for a series of financial activities; the second immediate 
time instance is the second time instance for another series of financial 

5 activities, etc. 

• For every time instance encountered, all the subsequent financial activities are 
considered as the series of financial activities between this time instance and 
the next time instance. 

• Financial activities are identified in terms of money amount; money amount is 
10 neutral in terms of deposit or withdrawal. 

• Each person's time sequence of financial activities is updated if new financial 
activities of this person are encountered in other places of the same document or 
in other documents. The financial activities of each time instance of a person is 
updated if new financial activities of this time instance of the same person are 

15 encountered in other places of the same document or in other documents. 

Based on the rules described above, whenever a new individual's name is 
encountered, a new PERSON event is created (see FIGURE 3); whenever a new time 
instance is encountered, a new TIME event is created under a PERSON event (see 

20 FIGURE 3); whenever a new financial transaction is encountered, a new 

TRANSACTION event is created linked to both corresponding TIME and PERSON 
events (see FIGURE 3). All the events are represented as vectors. FIGURE 3 depicts the 
data structure created by the present invention. 

Still referring to FIGURE 2, timelines are created 150 as a result of parsing 140 

25 the entire collection of documents and using the given data structure. Each timeline (see 
FIGURE 4) represents the financial transaction history vector of each individual. The 
time axis of the timeluies is divided into discrete time instances. Each node in the 
timelines is called a **monetary vector^' that records the part of the financial transaction 
history of the corresponding person between the current time instance and the next time 

30 instance. 
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While the above "one way nearest neighbor^' parsing principle may not be 
necessarily true in all the circumstances, it is believed to be the best for the following two 
reasons: (1) this is the best outcome in the absence of the actual association information 
in the data; (2) the experimental evaluations show that the generated models based on this 
5 principle are reasonably accurate. 

The next part of this step is to determine relevancy 160 or, determine which 
monetary vectors are "usefiil", i.e., is an individual related to the money laundering case 
being investigated, and which vectors are just noise (e.g., a "normal" financial transaction 
of an individual such as a "normal" purchasing activity, or a false association between 

10 one's monetary activity and someone else due to the one way nearest neighbor parsing 
principle). Since the present invention does not know the relevancy of the data, a "guess" 
must be made. During the data collection process the investigators typically have the 
intention to collect all the documents that are related to suspects in the case, or those 
either suspiciously or routinely related to the case; thus, it is expected that for those 

15 individuals who might be involved in the crimes, the majorities of their monetary vectors 
should be well clustered into several "zones" in the timeline axis (see FIGURE 4) where 
the actual MLCs are conmiitted. This assumption is referred to as the "focus " 
assumption. Based on the focus assumption, the present invention needs to pay attention 
to only the "clusters" of the monetary vectors in the timeline map, and can ignore those 

20 monetary vectors that are scattered over other places of the timeline map. This allows 
maximum filtering of the noise when determining the correlation between two 
individuals. 

The present invention next projects 170 all the monetary vectors of all the 
individuals into the timeline axis to form a histogram (see FIGURE 5). Consequently, 
25 the clustering problem is reduced to a segmentation problem in the histogram to divide 
the entire timeline into different time zones, or called groups of activities 180. 

A histogram is generated (see FIGURE 5) fi-om all the monetary vectors along 
the timeline. Since the projection and the histogram segmentation may be performed in 
linear time in the timeUne space, this clustering algorithm significantly improves the 
30 . complexity and avoids the iterative search a "normal" clustering algorithm such as the K- 



means algorithm would typically require. The resulted number of "hills" (i.e., segments) 
in the histogram becomes the K clusters or time zones as groups of activities. 

Link Hypothesis 

5 At this point the present invention has formatted the data in a maimer in which it 

can compute correlation values 200 among pairs of people. After clustering, each 
individual's financial transaction history vector may be represented as a timeline 
histogram partitioned into K clusters. The K clusters may in turn be represented as K 
histogram functions of time t: <fi(t)>, (where fi(t) is the financial transaction histogram of 

10 this individual in cluster i). The correlation between two individuals <x,y> is defined as 
an combined global correlation of all the local correlations between the two individuals, 
whereas the local correlation is defined as the correlation between two clusters of the 
timehne histograms of the two individuals. 

Global correlation is determined 200 fi-om local correlations between two 

15 individuals x and y (see FIGURE 6). The correlation is defined as this *two level" 

fimction due to the imique nature of the problem, i.e., individuals in the same MLC group 
may exhibit similar financial transaction patterns in different time "zones" (which 
constrains the local correlation), but the difference in the timeline of their financial 
activities should not be too large (which constrains the global correlation). While the 

20 local correlation is defined following a standard approach in Pattern Recognition 
literature to determining a fiizzified "similarity" between two fiinctions, the global 
correlation is defined based on the xmique nature of this problem to further constrain the 
overall "similarity" between the financial transaction pattems along the timeline of two 
individuals. 

25 In defining a reasonable correlation function, it should be noted that the concept 

of similar financial transaction pattems is always fiizzy. That is to say, if two individuals 
belong to the same crime group and are involved in the same MLC case, it is unlikely that 
they would conduct transactions related to the crime simultaneously at the exact time, nor 
is it likely that they would conduct transactions related to the crime at times that are of a 

30 year difference. It would be likely that they conduct the transactions at two different 
times close to each other. Consequently, we apply fuzzy logic in both definitions of the 
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local and global correlations to accommodate the actual "inaccuracy" of the occurrences 
in the extracted financial transaction activities between different individuals at different 
times. ^ 

Local Correlation 

The present invention defines fi:\(t) and fy\(t) be the financial transaction 
histogram functions of individual x and y in cluster / and y, respectively. Following the 
standard practice to define a fuzzified correlation between two functions, it then uses the 
Gaussian function as the fuzzy resemblance function within cluster / between time 
instance a and h: 



where is defined accordingly based on the specific context in this problem, and Wi is 
the width of the cluster /. 

The Gaussian function is used because it gives a natural decay over the time axis 
to represent the fuzzy resemblance between two functions. Consequently, two 
transactions of two individuals which occurred at closer times results in more 
resemblance than those which occurred at farther away times. It can be shown that after 
applying the fuzzy logic using the Gaussian function as the resemblance function, the 
resulting fuzzified histogram is the original one convolved with the fuzzy resemblance 
function. 



Thus, determining the local correlation 190 between fxi(t) and fyi(t) is defined as 
determining the maximum convolution value 





g(jc,,j;^-) = max T^o E ri-H., 



Global Correlation 

The present invention assumes that the timeline axis is clustered into K segments. 
Based on the definition of the local correlation 190, for each individual x, at every cluster 
/, there is a set of K local correlations with individual y {g(x/, yj),] = 1 , . . . , K} . It then 
assigns the fuzzy weights to each of the elements of the set based on another Gaussian 
function to accommodate the rationale that strong correlations should occur between 
financial transactions of the same crime group closer in time than those farther away in 
time. Thus, the following series results: 

{g(x„yy)S(/,y),y=i,...,K} 



10 where 



S{iJ) = i '''' 



and Ci and Cy are the centers of cluster / and cluster j along the timeline. 

The correlation between individual x in cluster / and the whole financial transaction 
histogram of individual y is then defined based on the winner-take-all principle: 

15 C(jc,,;^) = max %^{g{x,, y j)S {i, j)} 

Defining the vectors 

Cy{x)=< C(x,,;/),/ = l,..., a: > 

Cx(j;)=<C(x,^),/ = U.,is:> 

then computing global correlation 200 between x and y is defmed by computing the dot 
20 product between the two vectors: 

Link Generation 

After applying the correlation function to determine the global correlation 200 to 
25 every pair of individuals in the data set U, the present invention obtains a complete graph 
G(V, E) 210, where V is the set of all the individuals extracted fi-om the given collection 
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of the documents, and E is the set of all the correlation values between individuals such 
that for any correlation C(x, y), there is a corresponding edge in G with the weight C 
between the two nodes x and y. 

5 Link Identification 

For the problem of MLC group model generation 220, the present invention 
defines the function P in Link Identification as a graph segmentation based on a 
mininium correlation threshold 7. The specific value of Tmay be obtained based on a 
user's expertise (in this example a law enforcement investigator), which allows the user 

10 to validate different models based upon different thresholds and their expertise. Note that 
there may be multiple subgraphs M generated based on different values of T, indicating 
that there may possibly be multiple MLC groups identified in the given document 
collection. It is also possible that the original graph G(V, E) may not necessarily be 
connected (the complete graph G may have edges with correlation values 0, resulting in 

15 virtually an incomplete graph). Lastly, the generated models are output 230. 

While the preferred embodiments have been described and illustrated, it should be 
imderstood that various substitutions, equivalents, adaptations and modifications of the 
invention may be made thereto by those skilled in the art without departing fi-om the 
spirit and scope of the invention. Accordingly, it is to be understood that the present 

20 invention has been described by way of illustration and not limitation. 

What is claimed is: 



